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Minimax Filtering via Relations Between 
Information and Estimation 




We investigate the problem of continuous-time causal estimation under a minimax criterion. Let X'^ — {Xt, < 
t < T} be governed by probability law Pg from some class of possible laws indexed by ^ G 6, and be the 
noise corrupted observations of X"^ available to the estimator. We characterize the estimator minimizing the worst 
case regret, where regret is the difference between the expected loss of the estimator and that optimized for the true 
law of X'^. 

We then relate this minimax regret to the channel capacity when the channel is either Gaussian or Poisson. 
In this case, we characterize the minimax regret and the minimax estimator more explicitly. If we assume that the 
uncertainty set consists of deterministic signals, the worst case regret is exactly equal to the corresponding channel 
capacity, namely the maximal mutual information attainable across the channel among all possible distributions on 
the uncertainty set of signals, and the optimum minimax estimator is the Bayesian estimator assuming the capacity- 
achieving prior. Moreover, we show that this minimax estimator is not only minimizing the worst case regret but also 
essentially minimizing the regret for "most" of the other sources in the uncertainty set. 

We present a couple of examples for the construction of an approximately minimax filter via an approximation 
of the associated capacity achieving distribution. 

Index Terms 

Mismatched estimation, Minimax regret. Regret-capacity, Strong regret-capacity. Directed information, Sparse 
signal estimation, AWGN channel, Poisson channel. 



Recent relations between information and estimation have shown fundamental links between the causal estimation 
error and information theoretic quantities. In Duncan showed that causal estimation error of an additive white 
Gaussian noise(AWGN) corrupted signal is equal to the mutual information between the input and output processes 
divided by signal-to-noise ratio. In [2], Weissman extended the result to the scenario of mismatched estimation, 
where the estimator assumes that the input signal is governed by a law Q while its true law is P. In this case, the 
cost of mismatch, which is half the difference between the mismatched causal estimation error and the optimal (non- 
mismatched) causal estimation error, is given by the relative entropy between the laws of output processes when 
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the input processes have laws P and Q, respectively. In Atar et. al. showed that similar information-estimation 
relations exist in the Poisson channel for both mismatched and non-mismatched settings. 

In this paper, we investigate the continuous-time causal estimation problem. We assume that the input process is 
governed by a probability law from a known uncertainty class V where the estimator does not know the true law. 
In particular, suppose that the input process is governed by law Pg e V, where 9 £ Q and 6 is the uncertainty set 
known to decoder. In this setting, it is natural to consider the minimax estimator which minimizes the worst case 
regret, where regret is defined as the difference between the causal estimation error of the estimator and that of the 
optimal estimator. One of the main contributions of this paper is characterizing the minimax estimator, showing 
that it is in fact a Bayesian estimator under a distribution which is the capacity-achieving mixture of distributions 
associated with the channel whose input is a source in the uncertainty set. 

We can find similar arguments in the classical universal source coding theory. In this setting, encoder only knows 
that the source is governed by some law from an uncertainty set and the goal is to construct the universal code 
that minimizes the gap between its expected codelength and that under the optimum encoding strategy for the true 
law. More precisely, redundancy is defined as the difference between the expected length of the universal code and 
the expected length of optimal code for the true (active) source distribution. Redundancy capacity theory in this 
setting tells us that the minimax redundancy, which is the minimum of the worst case redundancy, coincides with 
the maximum mutual information between input and output of a channel whose input is a choice of a law from 
the uncertainty set and whose output is a reaUzation of that law. If the channel is either Gaussian or Poisson, we 
can combine the results of mismatched estimation and the above redundancy capacity theorem in order to relate 
the minimax regret to the corresponding mutual information. Indeed, the corresponding minimax regret turns out to 
be equal to the mutual information between the input index and the corresponding output which we shall refer to 
as "regret capacity". Moreover, the optimal minimax filter is Bayesian with respect to the same prior that achieves 
maximum mutual information. Therefore, if we know the distribution that maximizes mutual information, we can 
induce the optimal minimax estimator. Further, we shall see that if the class of measures V is a set of deterministic 
signals, this mutual information simplifies to the mutual information between input and output processes and 
F^. This allows us to harness well known results from channel coding to characterize and construct the optimum 
minimax filter. 

Since, by definition, the goal in minimax estimation is to minimize the worst case estimation regret, one possible 
critique is that it might not result in good estimation for many of sources in the class. However, in universal source 
coding theory, Merhav and Federi4J showed that the minimax encoder works well for "most" distributions in the 
uncertainty set, where "most" is measured with respect to the capacity-achieving prior which is argued to be the 
"right" prior Indeed, the framework of Q strengthened and generalized results of this nature that were established 
for parametric uncertainty sets by Rissanen in ||5). We can apply this idea to our minimax estimation setting. These 
results imply that the minimax estimator not only minimizes the worst case error, but does essentially as well as 
the optimal estimator for most sources. 

Our results for the Gaussian and the Poisson channel carry over to accommodate the presence of feedback, which 
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means that the input process at time t, Xt, is also affected by previous outputs {Yg : < s < t}. We show that they 
are still valid in the presence of feedback by substituting mutual informaiton with the notion of directed information 
in some cases as in continuous time developed in |[6). 

The rest of the paper is organized as follows. Section |ll] describes the concrete problem setting. In Section [Hi] 
we present and discuss the main results. Section IV provides proofs of the theorems. In Sections [V] and VI 



we 



provide examples and simulation results. We conclude with a summary in Section VII 



II. Problem Setting 

Let the input process X'^ ~ {Xt ,0 < t < T} he governed by probability law Pg from some class of possible 
laws indexed by e 9. 6 is an uncertainty set known to the estimator. Let Y'^ be the noise corrupted observations 
of X^ at the estimator, therefore, the probability law of also depends on the particular realization of e 8. 
Denote the input and reconstruction alphabets by X and X, respectively. In other words, Xt E X and Xt G 
where typically both X and X are E or M+. Let the measurabl^Z(-, •) : X x X i-^ [0, cx)) be a given loss function. 
For simplicity and transparency of our arguments, we assume that A" is a vector space and that •) satisfies the 
following properties: 

(PI) l(x,x) is convex on x; 

(P2) mm^^^E[l{X,x)] =E[l{X,E[X])]. 

The squared error loss function and the natural loss function l{x,x) = xlog(|) — x + x, introduced in |[3|, are 
examples of loss functions satisfying this property. Cf. |j7] for other loss functions of this type. 

Define the causal estimator Xt{') as a function of the output process up to time t, i.e. F* — {Ys,0 < s < t} 
and also define the causal estimation error associated with the filter X = {Xf (•), < t < T} hy 



cmle{0, X) = Ep, 
where i?Pe[-] denotes expectation under Pg. 



l{Xt,Xt{Y'))dt 



(1) 



III. Main Results 

A. Minimax Causal Estimation Criterion 

Suppose the estimator is optimized for law Q while the active law is Pg. Then the estimator will employ the 
Bayesian estimator Eg, where Eg = {EQ[Xf|-] : < t < T} denotes the Bayesian filter under prior Q, and the 
corresponding mismatched causal estimation error will be 



cmIe(0,EQ) =Ep, 



l{Xt,EQ[X\Y'])dt 



= cmlee,Q. 



(2) 



In particular, when the estimator is optimized for the true distribution, i.e., Q = Pg, the causal estimation error is 

l{Xt,Ep,[X\Y'])dt 



cmle(6i,EpJ = Ep, 



— cmlcfl 



(3) 



'From this point on we tacitly assume measurability of all functions introduced. 
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i.e., the Bayes optimum for the source Pg. 

Clearly, this can be considered our benchmark because it is the minimum causal estimation error when the 
probability law is exactly known. Now, similar to the universal source coding problem, define the regret of the filter 
X when the active source is Pg by 

R{e, X) = cmle(6', X) - cmlee,P, . (4) 

Since cmle^.p^, is our benchmark, it is natural to seek to minimize the worst-case regret over all possible 6 £ Q. 
Specifically, define minimax(0) as 

minimax(0) = inf sup R{9, X), (5) 
X eee 

where the infimum is over all possible filters. 
B. Main Results 

Similar to if the estimator is Bayesian under law Q, i.e., = EQ[Xt|F*], then denote the regret by 

R{0,X)^Rg^Q. (6) 

Theorem 1: Let Q denote the convex hull of the uncertainty set of all possible laws, i.e. Q — conv({Pe; G 6}). 
Let •) be a loss function with the above properties. Then 

minimax(9) = min sup Rg q (7) 
— min sup{cmlee q — cmlcg p^}. (8) 

QeQgfzQ 

Consider the following two canonical continuous-time channel models. 

1) Gaussian Channel: Suppose that under all Pg, 9 £ Q, is the AWGN corrupted version of X'^, i.e., 

dYt ^Xtdt + dWt (9) 

where is standard Brownian motion independent of X^. We consider half the squared loss function which is 
l{x,x) — ^{x — x)^, where we introduce the factor 1/2 to streamline the exposition that follows. 

2) Poisson Channel: Suppose that under all Pg, 9 d Q, Y'^ is a non-homogeneous Poisson process with intensity 
X'^, where X'^ is a stochastic process bounded by two positive constants. As in |^|, we employ the natrual loss 
function l{x, x) — x log(a;/i;) — x + x. This loss function is a natural choice for the Poisson channel, cf. |[3] Lemma 
2.1]. 

Note that in these two settings the uncertainty in Pg is only in the distribution of X'^, as the channel from X^ 
to is the same regardless of 9. We are now ready to state our main results. 

Theorem 2 (Regret-Capacity): Let the setting be either that of the Gaussian channel or the Poisson channel. Then 

minimax(e) = sup 1^,(6; F'^) (10) 
wefj.{e) 
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where /i(0) denotes the class of all possible measures on the set 8 and /io(6; Y'^) denotes the mutual information 
between 6 and when 9 ^ w and the conditional law of given 6 is the law of Y'^ under Pq. 

Theorem 3 (Minimax Filter): Suppose the supremum in Theorem 2 is achieved and let w* denote the achiever. 
Then the minimum in (j8]l is achieved by the Bayesian optimal filter with respect to Q*, the mixture of Pg's with 
respect to w*, i.e., 

Q* ^ [ Pew*ide) (11) 

and the minimax filter is 

Xt{Y')^EQ4Xt\Y% (12) 

Theorem 4 (Strong Regret-Capacity): Suppose the supremum in Theorem 2 is achieved and let w* denote the 
achiever For any filter X and every e > 0, 

R{e, X)> {l-e)- minimax(e) (13) 

for all 6* e with the possible exception of points in a subset B d Q, where 

< e-2-'™"™''('^l (14) 

Consider the case of the presence of feedback where Xt is also affected by previous output {Yg : < s < i}. 
More precisely, Xt can be viewed as a function of F*^"^ and R for some 5 > Q where R is an additional randomness, 
independent of any other processes. Let V he a class of joint laws of X'^ , and 8 be a set of indices of laws. 
Definition of minimax and Rq.q remain the same. Then, above theorems also hold, i.e.. 

Theorem 5 (Presence of Feedback): 

minimax(8) — min sup Rg q. (15) 
Moreover, if the setting is either Gaussian or Poisson, then 

minimax(8) = min sup Rg q (16) 
= sup/^(8;F^) (17) 

W 

= sup/(x'^^r'^)-/(x'^->y^|8) (18) 

w 

where I{X'^ Y'^) is the directed information from X'^ to Y'^, as introduced in ||6jj and precisely defined in 
Section IIV-A2I 

C. Discussion 

Theorem [T] implies that the optimum minimax filter is a Bayesian filter under some law Q. Furthermore, this 
minimum achieving Q is a mixture of Pg's. Therefore, in order to find the optimum minimax filter, it is enough to 
restrict the search space to that of Bayesian filters. This is equivalent to finding an optimal prior Q*, or optimum 
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weights w* over laws {Pg}. Note that we have not assumed anything on the statistics of the input and output 
processes but only the above mentioned properties of the loss function •). 

Theorem |2] implies that there is a strong link between the minimax regret and the communication problem, as 
in the theory of universal source coding. This mutual information is equal to I{X^;Y'^) — I{X'^;Y'^\Q) where 
the first term is the mutual information between input and output when the input distirbution is Q = Jg Pew{d6). 
Furthermore, Theorem [3] provides a prescription for such a filter in cases where the noise corruption mechanism is 
either Gaussian or Poisson. Note that if the uncertainty set consists of a set that constrains the possible underlying 
signals rather than their laws (e.g., all signals at the channel input confined to some peak and or power 



constraint) then the right hand side of ( lOi boils down to a supremum over all distributions on the set of allowable 
channel inputs, i.e., 

minimax(e) = sup I{X^;Y^) (19) 
= sup I{X^;Y^), (20) 



where Q = com{'P). (19 1 follows because X^ is deterministic given 9, therefore, I {X'^ ; Y'^ \Q) = 0. 

Note that the right hand side of the above equation is the capacity of the channel whose input is constrained 
to lie in the uncertainty set of signals at the channel input with respect to which the minimax quantity is defined. 
Moreover, letting Q* denote the capacity achieving distribution, the optimum minimax estimator is the Bayesian 
estimator with respect to the law Q* . More interestingly, Q* turns out to coincide with the classical notion of the 
least favorable prior from estimation theory. We establish this connection in detail in Appendix |l] These results 
show the strong relation between the minimax estimation and channel coding problems. 

In Theorem]?] we can see that our optimal minimax estimator minimizes not only the worst case regret, but also 
the regret for most 9 E <d under distribution w* . Cf. [4] for a discussion of the significance and implications of 
this result. For example, it implies that when 8 is a compact subset of M.'' and the parametrization of the input 
distributions Pg is sufficiently smooth, the minimax filter is essentially optimal not only in the worst case sense 
for which it was optimized, but in fact on "most" of the sources over all possible filters (Note that we are not 
restricting filters to be Bayesian). "Most" here means that the Lebesgue measure of the set of parameters indexing 
sources for which is vanishing as the value of minimax(8) is growing without bound, which is usually the case as 
T increases in all but the most degenerate of situations. 

Theorem ]5] implies that the above result can be extended to the case where feedback exists. Note that if P is a 
class of deterministic laws, i.e, Xt is a function of previous inputs and outputs given 6, then, 

minimax(e) = supI{X'^ F^). (21) 

w 

Recall, this is T times the channel capacity in the presence of feedback. 



A. Preliminaries 



IV. Proof 
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1) Redundancy Capacity Theory: In the context of universal source coding, let a;" = (xi, • • • , x„) be a sequence 
of symbols. Let {Pq : 6* G 9} be a set of probability laws of sequneces. Define redundancy by 

Rn{L,e) = EpjL(X")] - Hg{X^) (22) 

where is length of codewords for given uniquely decodable(UD) code and Hg{X") is an entropy of sequence 

with respect to Pg. Then, define minimax redundancy as 

Rn = min sup i?„ {L,0). (23) 
L see 

In |8], Gallager showed that minimax redundancy is equal to the capacity of the virtual channel, where its input 
is 6* e 8 and output is drawn by probability measure Pg{x"), i.e., 

Rn = Cn (24) 

where C„ = sup^, 7^,(8; X"). 

Furthermore, minimum achieving length function L* is related to the supremum achieving weights w* . More 
precisely, 

L*(2;") = -logg*(x") (25) 

where Q* = ^g^^^Pew*[de). 

Merhav and Feder Q proved the strong version of redundancy capacity theorem which is for any length function 
L of a UD code and every e > 0, 

Rn{L,e)>{l-e)Cn, (26) 
for all G 8 except for points in a subset B C Q where 

w* (B) < e ■ 2-"^" . (27) 

Note that the choice of probability measure w* is reasonable because it captures variety in sets (cf. Merhav and 
Federpl). As we discussed, this theorem implies that C„ is not only the minimum of worst case redundancy, but 
also close to minimum redundancy for most of other common sources. 

Most of ideas in universal source coding problem can also be applied to our setting. 

2) Directed Information: Given two random vectors X" and F", directed information can be defined as 
Definition 1 (Directed Infonnation( Discrete-time setting)): 

n 

/(X"^r")^^/(X';K,|r*-i). (28) 

In ||6), Weissman et al. extended this definition to the continuous time setting, i.e. directed information between 
two random processes X'^ and Y'^ . For given vector t = {to, ■ ■ ■ , tn) where ^ to < ti < ■ ■ ■ < tn — T, define 
Xq'*^ = {Xq'-jXI^, ■ ■ ■ ,Xf^__^) and treat Xq'^ as a n dimensional vector. Using this notation, we can define the 
directed information between two random processes. 
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Definition 2: 



(29) 



where the infimum is over all finite dimensional vectors t. 

We refer to fE] for more on the properties of directed information and its significance in communication and 
estimation. 



B. Proof of Theorem [7] 

Proof We denote the class of measures on 8 by IJ.{Q), i.e., w € /i(0) can be viewed as a weight function of 
each probability distribution in Pg where 6 E 0. Then we have 



minimax(8) = inf sup R{9, X) 
X eee 



inf sup < 

X dee 



l{XuXt{Y'))dt 



cmle 



inf sup < / lEpe 
X u)GAi(e) I Jeee \ 



l{XuXt{Y'))dt 



= inf sup < Ep„„ 
X u,eM(e) \ 

> sup inf < Ep^^ 
«ieM(e) ^ I 



sup < 




[i 









sup min < Ep^^ 



min sup < Ep^, 



l{XuXt{Y'))dt 



l{Xt,Xt{Y'))dt 



l{Xt,¥.p^AXt\Y'])dt 



l{Xu¥.Q[Xt\Y'])dt 



l{XtMQ[Xt\Y'])dt 



eee 



mm sup < / Ep^ 



l{Xt,¥.Q[Xt\Y'])dt 



dw — cmlee^Pg J w{d6) j 
cmlee^Pgw((i6')| 
cmlee^PgW(d0)| 
cmle0.pj,w(d^?)| 
cmle6(,p^w(c?6')| 
cmle6)^P^w(d6')| 

I w{de) 



eee 



ee 



ee 



cmlea 



min sup < Ep„ 



l{XuKQ[Xt\Y'])dt 



cmlee 



min sup {cmleg q — cmleg } 

QeQgfzQ 

min sup Rg n. 

Q&Qeee 



where: 

. In ([33I1, we set Pay = J Pgw{d6). 



(30) 

(31) 

(32) 

(33) 

(34) 

(35) 

(36) 

(37) 

(38) 

(39) 
(40) 
(41) 



(34 1 is because for any real-valued function f{x,y) on Xxy, inixex sup^^-^; /(x, y) > sup^^-y inf2,g;t' f{x,y). 



(35 1 is because the loss function I satisfies above property that expectation minimizes the loss function. 
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(37 1 is becuase /i(6) and Q is compact and convex subset of linear topological spaces. Also, the quantity is 
convex on Q and concave (in fact, linear) on w, thus we can apply the minimax theorem. 



The opposite direction is trivial, that is 

T 

infsup<'Ep, / l{Xt,Xt{y'))dt 



cmleg Pe r — ^^'^ ^^^P {^Pe 



l{Xt,EQ[X\Y'])dt 



- cmlee.p. 



Therefore, 



minimax(9) — inf sup R{0, X) ~ min sup Rg q. 



(42) 
(43) 



C. Proof of Theorems |2] and |5] 

Proof For both Gaussian and Poisson setting, the cost of mismatch is related to relative entropy between 
outputs corresponding to input laws Pg and Q, respectively||2|||3), i.e., 

cmleg^Q - cmlee,p, = D{{Pg)YT\\QYT), (44) 

where {Pg)YT is the distribution of Y'^ when the law of the input process is Pg, and similar for QyT. Using 
similar argument from classical minimax redundancy theory, we can get 

minimax(8) = min sup{cmle6/ q — cmleg p^} (45) 

QeQg^Q 

^ min sup D{{Pg)YT\\QYT) (46) 

= min sup / d{Pg)YT log f (47) 

= min sup / [ d{Pg)YT log ('^^^^\wm (48) 

= sup min / / d{Pg)YT log f w{dd) (49) 



w{de) 



= sup min ( [ d{Pg)YT log ( -jr—^ 
wef^{e)Q^Qj J \d(PavjYi 

+ J J d{Pg)YT log wide) (50) 

= sup rmn [ D{{Pg)YT\\{Pav)Y^)w{de) + D{{Pa^)YT\\QYT) (51) 
= sup [ D{{Pg)YT\\{Pa^)YT)w{d0) (52) 

we^l{B) J 

= sup 7^(6; r^). (53) 

iiiGA'(9) 



This completes the proof of Theorem [2] 
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In (52 1, if supremum achieving w* exists, minimum achieving Q* is a weighted sum of probability measures, 
i.e.. 



Q* = [ Pew*{d9). 



Therefore, 



minimax(0) = supjcmlegg. — cmle^^pg}, 

eee 



which implies the optimum minimax estimator is a Bayesian estimator based on law Q*, i.e.. 

X{Y')=EQ,[Xt\Y']. 



(54) 



(55) 



(56) 



D. Proof of Theorem |£] 

Proof: The idea of proof is similar to those in |4| except the fact that we consider not only Bayesian estimators, 
but also general estimators. For given estimator X* and e > 0, define the set B = {6* : R{9,X*) < (1 — e) • 
minimax(0)}. Then, by definition of B, we have 



ininimax(i3) = inf sup R{9, X) 

X eeB 

< sup R{9,X*) 
eeB 

< (1 — e) • minimax(8). 



(57) 
(58) 
(59) 



Consider 8 as a random variable with measure w* . Let Z ~ Ijeg^j be a binary random variable, then P{Z 
1) = w*{B). Note Z - 6 - Y'^ is a Markov chain, thus, we have 



minimax(e) = 7^.(6;^^) 

= I{Z;Y^)+I{Q;Y'^\Z) 

= I{Z; Y^) + P(Z = l)I{Q; Y^\Z = 1) + P{Z = 0)7(9; Y'^\Z = 0) 

< I{Z;Y'^)+w*{B) ■ minimax(_B) + (1 — w*{B)) ■ minimax(9) 

< H{Z) + ((1 - €)w*{B) + 1 - w*{B)) ■ minimax(e). 
Since P{Z = 1) = w*{B), we have 



which implies 



logw*{B) - ^ log(l - w*{B)) > e ■ minimax(e), 

w*{B) <e- 2-'^™"™''(e). 



(60) 
(61) 
(62) 
(63) 
(64) 

(65) 

(66) 
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E. Proof of Theorem |5] 

Proof: Proofs of Theorem [T] and Theorem |4] are still valid even in this case. Moreover, the result of cost of 
mismatch also valids in the presence of feedback||3) . All we have to prove is the last part of the theorem which is 
analogy of Theorem |2] 

Recall the definition of directed information in countinuous-time setting. For fixed time intervals = io < < 

t2<-- - <tn^T. 

n 

/(e;y^) = ^/(e;y/;_jy*'-^) (6?) 

i=l 

= E / log -^n^ -dPy.^^e (68) 

n „ dPyH Ijfti yti_i_e dPyH \x*i,Y'^-'^,0 

= E / log —iW ^ log -TH^ dPx^^YH,e (69) 

n „ dPyH \xti.Y*i-i r dPyh |x*..y*i-i.e 
= E / log -^u^ dPxrvH - / log ---^ -dPxH.YH,e (70) 

n 

= ^ /(r/;:^ ; X*- - I{Y,;X*^\Y''-' , 9), (71) 



where (70i is because 9 — {X^\Y^^-^) — Y^^__^ forms a Markov chain. Since the equlality holds for any choice 



of time intervals, by taking limit supj \ \ti — — )■ 0, we can argue that 



minimax(8) = min sup Rg q (72) 

Q^Qeee 

= mm snpD{{Pg)YT\\QYT) (73) 
QeQ 

= sup/^(e;y^) (74) 

W 

= sup/(X^ ^ Y^) - I{X^ Y^\e). (75) 



V. Examples 

A. Gaussian Channel and Sparse Signal 

Based on the above theorems, we first apply them to the problem of sparse signal estimation under Gaussian 
noise. 



III-Bl 



1 ) Setting: We assume output process Y^ is AWGN corrupted version of as we discussed in Section 
while input process X'^ is sparse which will be explained in the following. Recall that we are using half of a mean 
squared error as a distortion measure, l{x, x) — ^{x — x)"^. 

Let {ipi{t),0 < t < rjf^i be a given orthonormal signal set. Suppose X'^ is a linear combination of <j)i{tys, 
i.e. Xt = ^i4>i{t) where {A^}"^]^ are random variables with unknown distribution. However, we assume that 
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the estimator knows that the signal X"^ is power constrained and is sparse, by which we mean that the fraction of 
non-zero elements in {Ai] should be smaller than q (i.e., at most nq number of A^'s can be nonzero). Let V he a 
class of all possible probability measures Pe of vector A = {Ai, ■ ■ ■ ,An) indexed by 6 which satisfies these two 
constraints, i.e.. 



V 



|^«^^«(^E^'^^) -l'^«(^El{^.^o}<9) =l|. (76) 

Note that X^dt = X]r=i ^? because of orthonormality of basis, therefore, it is equivalent to consider ^ Yll^=i — 
P as a power constraint. Define an uncertainty set Q by set of such indices. It is clear that V = {Pe : G G} is a 
convex set. 

We further define Vd and Vav in a similar manner 

f 1 " 1 " 1 

Vd^ <Pe ■■ Pe (A" = a") = 1 for some a" such that - E - ~ E ^I'^.^o} < Q \ (77) 

I ^ i=l 4=1 J 

- n 



n 

1=1 



i=l 



<q)- (78) 



We can understand Vd as a class of deterministic measures, and Vav as a class of measures that satisfy average 
power and sparsity constraints in expectation while measures in V satisfies constraints with probabiUty 1. Also, 
define the corresponding set of indices as O^) and Qav, respectively. There are some simple relations among these 
sets. 

. VdCVc Vav and Gi, C 9 C Oav 

» V is a convex closure of Vd, i-e. V = com{VD)- 

2) Apply the Theorem: Theorem |2] implies that 

minimax(e) = sup /(X^; F^) - /(X^; F^IG). (79) 
to(-)eM(e) 

Since our optimum causal minimax estimator is Bayesian estimator under the distribution Q* = j Pgw*{d9) 
where w* is supremum achiever, we are interested in w* . Rather than maximizing the difference between mutual 
informations, we can find an equivalent problem which is much easier to handle by exploiting the relation between 
minimax(G) and minimax(G£)). 

Lemma 6: 

minimax(GD) = minimax(0). (80) 
Proof is given in Appendix [HI Since Vd is a set of deterministic measures, we can get more explicit formula of 



minimax (G^)) as we showed in Section III-C[ 

minimax(G) = minimax(G£)) (81) 

sup I{X'^;Y'^) (82) 
«'(-)eA'(©u) 

= sup I{X'^;Y'^). (83) 

Peer 
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Since X'^ is governed by the law j Pgw{d9), therefore, it is equivalent to maximize the mutual information over 
all possible mixture law instead of finding optimum measure on Qo. Moreover, the minimum achiever Q* of 
minimax(0£)) coincides with that of minimax(9). Thus, it is enough to consider minimax(0i3) which is much 
simpler to solve. 

Now, consider the minimax(8ai,). 

minimax(6) = min sup cmleg g — cmle^ (84) 

= min sup cmleg q — cmleg (85) 

QeVa^ see 

< min sup cmleg q — cmleg Pg (86) 

= minimax(0at,) (87) 
sup I{X'^;Y'^) - I{X^;Y^\e) (88) 



where (85 i is because Bayesian estimator with prior Q* £ V is optimum over all possible filters, therefore we do 
not have to restrict Q in the class V and Q* is always the minimum achiever. 



3) Sufficient Statistics: Since the channel input signal is a linear combination of orthonormal signals, sufficient 
I output 

mutual information I{X'^-,Y'^) can be further simplified as 



statistics of the channel output signal are projections on each (f>i's, i.e., {/q^ Therefore, the above 



minimax(e) = sup I{A'']B'^) (89) 

PeeV 

where Bi = (f)i{t)dYt for 1 < i < n. Since we assumed an orthonormal basis, can be viewed as the output 
of a discrete-time additive white Gaussian channel, i.e. Bi = Ai + Wi where Wi is i.i.d. standard Gaussian noise 
and independent of A''\ This implies that our problem of maximizing the mutual information over the continuous 
time channel is equivalent to maximizing the mutual information between n channel inputs and n channel outputs 
over the AWGN channel, with the input distribution constrained as in ( [76| . 

Recall that above result shows that sufficient statistics for estimating Xt given Y^ are projections, i.e., | (j)i{s)dYs 
in other words, the following Markov relation holds 

Xt - 1^ - Y^. (90) 

Similarly, the following lemma shows that | (j)i{s)dYs^ are sufficient statistics for estimating Xt given . 
Lemma 7: The following Markov relation holds for all t € [0,T], 

Xt ~ { [ Ms)dYA - Y\ (91) 



Proof of this lemma is given in Appendix III 
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4) Gaussian Channel with Sparsity Constraint: The problem supp^^p^^ I{A";B^) was recently considered by 
Zhang and Guo in Q, where they referred to it as "Gaussian channels with duty cycle and power constraints". 
They have shown that the distribution on A" that maximizes this mutual information is i.i.d. and discrete. In other 
words, letting Pj^ denote the distribution on A that maximizes I{A; B), when B = A + W for a standard Gaussian 
noise W which is independent of A, among all distributions constrained by E[j4^] < P and P{A 0) < q, their 



results imply that Pd is discrete and, when combined with ( 78 i, imply that 

sup I{A'^;B") = n[I{A;B)]p^^p^. (92) 

5) Bayesian Estimator: Let Q* be the minunum achieving law of minimax(0) so that the optimum causal 
minimax estimator is a Bayesian estimator assuming the prior Q*, i.e., 

Xt{Y')^EQAXt\Y% (93) 

This conditional expectation is hard to compute in general, however, we know sufficient statistics which allow us 
to implement the estimator in a practical sense. 
Let us first , define following terms 

Yit) = {Yi{t),Y2{t), ■ ■ ■ X{t)f where Y,{t) = f (j),{s)dYs (94) 

Jo 

W(i) = {W^{t)Mt), • • • , W^{t)f where W,{t) = / Us)dWs (95) 

Jo 

±(t) = {X^{t),X2{t),--- ,X,,{t)f where <j,,{s)Xsds = Y/^, (^j^ 0,(s)</>, (s)ds^ (96) 

T(t) = n by n matrix where (r(t))ij = (j)i{s)(j)j{s)ds. (97) 
Note that W(t) is Gaussian with zero mean and covariance matrix T{t). This is because 



nW^{t)WJ(t)]^W. 



c^,[s)(l>,{u)dWsdWu 



(98) 



Us)'Pj{s)ds. (99) 
From Lemma |7] for fixed t, the causal estimation problem is reduced to the following vector estimation problem 

Y(i) =X(t)+W(t) =r(t)A + W(i) (100) 
where A = A" = (^i, • • • , An)^ and W(t) ^ A/'(0, T{t)), and the corresponding Bayesian estimator will be 

Xt{Y')=EQ,[Xt\Y'] (101) 

n 

= Y,¥.Q,[A,\Y{t)]cj,,{t). (102) 
1=1 

If T{t) is invertible, this problem is simple. Even if V{t) is not invertible, it is still symmetric and we can use 
several tricks to solve it. Suppose the eigenvalue decomposition of matrix T{t) is V{t) — V {t) K{t)V {t)^ where 
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V{t) = [vi{t),--- ,Vn{t)] is an orthonormal matrix and A(t) = diag(Ai(i), A2(t), • • • ,A„(i)) with < \i{t) < 
^2{t) < ■ ■ ■ < A„(<). We can rewrite the problem as 

V{t)'^Y{t) = A(t)V{tfA + V(tfW{t). (103) 

Note V{t)'^W{t) 7V(0,A(i)). Let m be the number of zero eigenvalues, i.e. Xi{t) = ••• = A„,(t) = < 
Am+i(0- Clearly, the first m elements can be removed, therefore we can define effective values of vectors as 

W*) = (104) 

Aeff(t) = diag(A„,+i(t), • • • , A„(t)). (105) 

Therefore, the above vector estimation problem can further be simplified as 

V^ff{tfY{t) - A^ff{t)V^fi{tfA + V^ff{tfW{t) (106) 

\ff{t)-'/%ff{tfYit) = Aeff(i)i/2yeff(*)^A + A,ff(i)-^/Veff(O^W(i). (107) 

Note that Agff(t)-i/2^/^(t)Tw(t) ^AA(0, /„_„). 

6) Almost Optimal Causal Minimax Estimator: Combining with Lemma |7] we have the formula of the optimal 
causal minimax estimator = EQ,[Xt\Y'] = Eg. [X^l Y(i)]. Since Eg. [Xt|Y(t)] = ELi Eg- [A,|Y(O]0. (i), 

it is enough to have a posterior distribution of A. However, it is hard to find a maximum achieving distribution 
in some cases, indeed most of the problems of finding capacity achieving distribution are still open including the 
sparse signal estimation problem that we are looking at. Therefore, we will use an approximated version of the 
prior, Q, so that we can easily implement the filter. One natural choice of Q is the capacity achieving distribution 
of suppggp^^ which is i.i.d. of Pd- Then the following question is the performance of this alternative 

filter compare to the optimum minimax filter, i.e., 

i(9, Q) = sup Rg Q — min sup Rg q. (108) 

Following lemma gives an upperbound of L{<d,Q). 

Lemma 8: For particular choice of Q that we stated above, 

L(e,Q) < - [I{A-;B-)]p^_Q, . (109) 

Proof is given in Appendix |IV] This result implies that if these two mutual informations are close enough, then the 
worst case error of alternative Bayesian filter with prior Q is close to our benchmark which is minimax(0). Since 
Q is i.i.d. Pd, the first term of upperbound is [/(A"; i3")]p^^^g = i?)]p^=p^. Therefore, it is enough to 

ai-gue that n[I{A; B)]p^=p^ ~ [/(A"; i?")]p^^^g» is small enough. The following lemma suggests that above two 
mutual informations are close for large n. 
Lemma 9: 

WuY n[I{A]B)]p - sup /(A";B") = (110) 
Proof is given in Appendix [vj Finally, we get the close optimal filter EQ[Xi|y*]. 
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B. Poisson Channel and Direct Current Signal 

Consider direct current(DC) signal estimation over the Poisson channel. The input process Xt = X for all 
< i < T, where is a random variable bounded hy a < X < A where a,^ are positive constants. We can 
define uncertainty set O such that {Pg : 6* e 6} is the set of all possible probability measures on X under which 
a < X < A almost surely. The estimator observes Poisson process with rate Xt and performance is measured 
under the natural log loss function l{x,x) = a;log(x/a;) — x + x. 

Similar to the previous section, we can define and prove minimax(9) = minimax(9D)- Also, since {Pg : 
6* e 8} is convex and since Yt is a sufficient statistic of for X^ (which is constant at X), we have 

minimax(8) = minimax(0D) (111) 
= sup I{X'^;Y'^) (112) 

^ sup I{X;Yt), (113) 

where the maximization is over all distributions on X supported on [a, A]. Corresponding communication problem 
is that of the capacity of the discrete-time poisson channel, where the input is non-negative, real valued X with 
a peak power constraint a < X < A a.s. and the output is Poisson random variable with parameter TX. In this 
scenario, Shamai pO) showed that capacity achieving distribution is discrete with finite number of mass points. Let 
Pg be this capacity achieving distribution. Although analytic expression of Pg and capacity of the channel are still 
open, we can approximate the distribution numerically to arbitrary precision. 

Using Theorem [3] we can conclude that the optimum minimax causal estimator is conditional expectation of X 
given Yt with respect to the distribution Pg, i.e., 

XtiY')=EpAX\Yt]. (114) 
VI. Experiments 

A. Gaussian Channel and Sparse Signal 



Consider the setting of Section |V-A| In order to compare the performance of the suggested minimax filter, we 
introduce some possible estimators. One naive choice of estimator is the maximum likeUhood(ML) estimator. Recall 



(107 1, ML estimation of vector A is given as 



A = (AeffW'/'W^)^)^ AeffW-^/Veff(t)TY(i) (115) 

where X'^ is Moore-Penrose pseudoinverse of matrix X. 

Moreover, using the side information that vector A is sparse, we can further apply soft/hard thresholding technique 
to improve estimation. For example, we can only take the largest nq elements of A, or get rid of elements which 
are smaller than certain threshold. 

Another estimator to which it is meaningful to compare is the minimax estimator that lacks the sparsity in- 
formation. Since the estimator does not know that the signal is sparse, it assumes the uncertainty set is Vls = 
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{Pg : Pg{^\\A\\2 < P) = 1}. Using similar ideas in the previous section, we can relate this minimax opimization 
problem to the channel coding problem on the Gaussian channel with average power constraint. Moreover, we can 
find the almost optimum minimax filter which is Bayesian with i.i.d. Gaussian prior, i.e., A ^ J^{0, Pin)- Note 
that this filter turns out to be linear which is easy to implement. Using the result of the previous section, we have 

AeffW-'/'l^ffW^YW = AeffW^/^VeffW^A + A^ffit)-'/'V^ffitfWit). (116) 

Since every components are Gaussian, we can easily compute the conditional expectation. Recall, A ^ A/^(0, Pin), 
and Agff(i)-i/2Feff(t)Ty (i) ^ AA(0, PA^ff{t) + /„_,„). Therefore, 

E[A|Agff(t)-VV,ff(O^Y(t)] = P {A^ff{t)'/Xsitfy (P^cffit) + In-n^y Aeff(t)-i/V,ff(i)^Y(i) 

(117) 

= ^^effW {PA^ff{t) + In-n,)'' FeffW^Ylt). (118) 

Now, consider the genie aided scheme which allows additional information of source. Suppose decoder knows 
the position of nonzeros ii, • • • , ife. Then, this scheme should work better than all other schemes. Using similar 
idea of previous section again, conditional expectation assuming i.i.d. A/^(0, ?iP/fc)(over nonzero positions) prior is 
close to optimum, i.e. Anonzero ^ A/^(0, ^Ik) where Anonzero is a vector of nonzero elements of A. Using the 
result of the previous section again, 

A,ff(t)-i/Veff(O^Y(t) = Aeff(t)i/2Veff(*)^A + A^ffit)-'/Xffit)^'^it)- (119) 

Let f/gff be a matrix consisting of columns of Agff(t)^/^V^ff(t)^ which coincides with nonzero position of A. 
Then we can rewrite the equation as 

Agff(t)-i/Veff(i)^Y(t) = iJeffAnonzero + A^ff (i)-i/Veff(O^W(t). (120) 

It is clear that A^ff{t)-^/Xff{t)^Y{t) - 7V(0, PA^ff{t) + /„_,„). Therefore, 

n P 

E[Anonzero|Aeff(t)-i/2v^eff(i)^Y(0] - —U^ffiU^ffU^^^ + In-„r'A^ffit)-'^Xffit)'^Yit). (121) 

Similar to |j9), we approximate P^; with finite number of mass points. Initially, find an optimized mutual 
information for three mass points, then increase the number of mass points until the increment of optimized mutual 
information is smaller than 10^^. Using approximated version of Pd, we compare the performance of estimator in 
Figure [T| Here we set n = 7, k = 2, P = 10"^(4dB), and Haar basis as an orthonormal signal set. We generate 
random sparse coefficient and take an average of causal squared error over 100 simulations. When we generate 
random coefficient, we first choose n — k zero coefficients randomly, and draw k non-zero coefficient according to 
Gaussian distribution. Note that we are randomly generated signals therefore causal errors in the above experiments 
are not the worst case error, however, we can check that optimum minimax estimator outperforms maximum 
likelihood estimators and minimax estimator without sparsity knowledge. Note that the performance of minimax 
estimator is comparable to genie-aided estimator although genie-aided estimator has much powerful additional 
information. 
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Fig. 1: Plots of cmle for the experiment of SectiorVI-A Here we have taken T — 10. Xt is randomly generated 
according to Gaussian distribution 100 times and we computed average causal loss for each filter 



B. Poisson Channel and DC Signal 

For comparison, we present some other natural estimators. First, as in the previous section, we can employ ML 
estimator, i.e., 

Xml{Y') = argmaxP(Ft|X = x). (122) 

X 

Note that conditional distribution is PiYt\X = x) = - — y^^-* ' , which is maximized at x = Since estimator 
knows X is bounded hy a < X < A, ML estimator can be written as 

^AfL(i"*) -min|max{a,^},A|. (123) 

Another possible estimator is a Bayesian estimator, assuming X has uniform distribution, i.e. X ~ C/[a,^]. In 



19 



— ML 

— Minimax 

— Uniform prior 




Fig. 2: Plots of cmle for the experiment of Sectior VI-B Here we have taken T = 10. Xt is randomly generated 
according to uniform distribution 100 times and we computed average causal loss for each filter. 



this case, the optimum Bayesian estimator is readily obtained explicitly and given by 

-^unif(^ ) = 



(124) 



Figure |2] shows numerical results for a — 0.5, A = 2 case. We take an average of causal mean loss error over 
100 times for X — 0.5, 1, 1.5, 2 and find an worst case error Compare to Bayesian estimator with uniform prior, 
minimax estimator shows much better performance. 



VII. Conclusions 

We considered minimax estimation, focusing on the case of causal estimation when the noise-free object is a 
continuous-time signal and governed by a law from a given uncertainty set. We showed that the optimum minimax 
filter is a Bayesian filter if the distortion criterion satisfies certain properties. We also characterized the worst 
case regret and the minimax estimator in the case of Gaussian and Poisson channels by relating it to a familiar 



20 



communication problem of maximizing mutual information. Using the idea of strong redundancy/regret-capacity 
theorem, we showed that our minimax estimator is optimal in a sense much stronger than it was designed to optimize 
for. Using these results, we presented two examples: sparse signal estimation under Gaussian setting and DC signal 
estimation under Poisson setting, for which we have used our results to derive and implement the minimax filter 
and exhibit its favorable performance in practice. 

Our estimation framework can be extended to and applied in many other estimation problems. One possible 
extension is to apply Theorem |5] to stochastic learning problems of the type considered by Bento et al. in pTj . In 
this setting, the process is defined by stochastic equation Yt — F{Yt;A)dt + dWt, where A is an unknown 
random parameter and W'^ is standard Brownian motion. We can set Xt — F{Yt; A) and consider our estimation 
framework with feedback. We can apply our frameworks to estimate X'^ in the minimax sense of the present paper 
and, through that, learn A. It will be interesting to investigate how an estimator guided by this approach would 
compare to that in | flT[ . 



Suppose 5 is a class of possible input signals with corresponding index class Q, i.e., S = {fe}0ee- Let Pg be 
a deterministic measure such that Pe{fe) = 1- The input process Xt is equal to fe{t) for some 9 E Q which is 
unknown to the filter. Instead of the minimax criterion that we discussed so far, we can consider the same problem 
in a Bayesian setting, namely where the input signal {Xt,0 <t<T}is governed by a probability law defined on 
S. The goal is to find the least favorable input distribution Q E which causes the greatest average loss (rather 
than regret). We refer to fl^ Chapter 5] for a smiliar concept in point estimation theory. Define average loss when 
the input distribution is Q with optimum Bayesian estimator EQ[Xt|F*], 



^0 

Note that cmleg pg = since the input process is deterministic under Pg and, therefore, the regret and the loss 
itself are the same in this case, i.e.. 



In this setting, the minimax estimator can be viewed as an achiever of min^,^ supggQ cmle(6', X). 
More formally, we define the least favorable prior as follows. 

Definition 3: A prior distribution Q is least favorable if rq > rQ> for all prior distributions Q'. 
The relation between the minimax estimator and the least favorable input is characterized in the following theorem. 

Theorem 10: Suppose that Q* is a distribution on S such that 



Appendix I 



Least Favorable Input 




R{e, X) = cm\e{e, X) - cmleg.p, 



cm\e{0,X). 



tq, = sup cmlee^Q. 
see 



Then: 
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1) EQ*[-'^^t|-] is a minimax estimator. 

2) If Eg* [Xt\-] is a unique minimizer of min^^ cmle(6'. X), then it is the unique minimax estimator. 

3) Q* is least favorable. 
Proof: 

1) 

supcmle(6',i') > cmle {9, X)dQ* {9) (125) 
eee J 

> J cm\ee,Q'dQ*{9) (126) 

= TQ, (127) 
= supcmlee Q.. (128) 



2) (126 1 implies uniqueness of minimax estimator 
3) 



rg. = Eg^cmlee^Q-] (129) 

<EQ4cmlee,Q.] (130) 

< supcmleg Q. (131) 

= TQ. (132) 



Theorem 11: If Q* is a capacity achieving prior of the channel when the input is restricted to the set S, then 
Q* is a least favorable input. 
Proof: 

min supcmleg^Q = sup 

QGM(5)ege ' Qep.{s) 

Since Q* is achieving minimum of mingg^j^) supg^Q cmle^g and supremum of supg^^^^j I{X^ \Y^), 

sup cmle,,g. = [I[X^- r^)]x-~Q- (133) 
eee 

= Eg.[cmlee,g.], (134) 

where ( |134| l is due to I-mmse relation. This result tells us that Q* satisfies the condition of Theorem [Toj therefore, 
the capacity achieving prior is the least favorable input. ■ 
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Appendix II 
Proof of Lemma[6] 



minimax(0i3) — min sup Re^Q 
Qecom{Vo) eeeo 

= min sup Rg n 



< min sup Rg n 
— minimax(8) 



On the other hand, 



minimax(0) = min sup Rg q 

= min sup E p„ 



< min sup E p„ 



- min sup / E 

Q^'PeeeJ 



< min sup E 



— min sup Ep„ 



= min sup Ep„ 



r l{Xt, Eq[X\Y']) - l{Xt,Ep^ [X\Y'])dt 
Jo 

[ liXuEQ[X\Y'])dt 
'o 

/ l{Xt,EQ[X\Y'])dt A^^a"" dPeia") 
'o 

r l{Xt,EQ[X\Y'])dt 
Jo 

[ l{Xt,EQ[X\Y*])dt 
Jo 

r l{Xt,EQ[X\Y*]) ~ l{Xt,Ep,[X\Y'])dt 
Jo 



A" = a" 



= min sup Ro n 



(135) 
(136) 
(137) 
(138) 

(139) 
(140) 

(141) 

(142) 

(143) 

(144) 

(145) 
(146) 



= minimax(8D) 

where T'"' = {a" € M" : i Y.l=i a? <P,^ Eti H^, ^ 0) < g} is a set 
Then these two inequalities imply minimax(0) = minimax(8£)). Indeed, supg^Q^R^.Q — ^^Peeeo ^^.Q holds for 
any Q e in general. 



(147) 

; of vector a" that satisfies constraints. 
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Appendix III 
Proof of Lemma[7] 

Proof: At time t, all informations that estimator can get is Y*-, and it can be approximated as 



Yt/N 




MO) 


MO) 


MO) 




ai 




Y2t/N — Yt/N 


1 

~ N 


Mt/N) 


Mt/N) 


Mt/N) 




0-2 


+ 


Yni/n ^ ^(7V-i)t/Ar_ 




_MiN-l)t/N) 


M{N-l)t/N) ■■ 


■ M{N-l)t/N)_ 




_an_ 





Wt/N 
W2t/N - Wt/N 



(148) 

which we can rewrite the formula as F = 'k^^ where W ^ A/'(0, j^In)- Furthermore, (j)i{s)dYs can be 

approxiamted as 



N 



J2 Mik - l)t/N)iYkt/N - Y^k-i)t/N)- 



(149) 



fc=i 



This approxiamtion is similar to the idea from Ito's integral, and it is enough to prove the lemma based on this 
approximation. Therefore, the lemma holds if and only if = for all Y. This is equivalent to 

p(^^F|i) constant (independent of choice of A) for all Y. Throughout the proof, we assume is invertible, 
however, it is not difficult to derive the similar result where is not invertible. 



piY\A)^piW^Y~j^^A) 



1 



(27r(l/iV)W)^/2 
1 



exp 



■ exp 



mN _ -I _ _ 1 

— (y--M-(r--$A) 



2 ^ TV ' 



On the other hand. 



(150) 
(151) 
(152) 

(153) 



1 



- - 1-- -- - - 1 

(27rdet((l/7V)$^$))"/2 ^\ 2^ N I \ J y ^ Jl ^ ^ 



1 



(27rdet((l/7V)$'r$))"/2 



exp 



2 ^ ^ ^ N ' 



Thus, 



p(Y\A) _ (27rdet((l/iV)$^$))"/2 



p{^'^Y\A) 



(27r(l/Ar)^)^/2 



exp 



N 



N 



(155) 



(156) 



Therefore, the fraction 



P(Y\A) 
p(*^F|A) 



is independent of choice of A. This completes the proof of lemma. 
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Appendix IV 
Proof of Lemma[8] 

Proof: Let define a class of all deterministic laws Vo.aii — {Pe ■ ^'0(0") — 1 for some a" e M"} with 
corresponding index set Qdmi and class of measures on Qo^aii with additional constraint ^d,q„ — {w ^ iJ^{Q>D,aii) ■ 
J Pgw{de) e Vav}- Then, 

mill sup [ D{Pg\\Q)w{de) ^ mm sup [ D{P0\\Q^)w{d0) + D{Q,,\\Q) (157) 

sup min [ D{Pe\\QMdd) + D{QJ\Q) (158) 
sup /" D{Pg\\Q^)w{d9) (159) 



sup I[Q;B'^) (160) 
sup I{A"-;B'') (161) 
: sup I{A'';B'') (162) 



= (163) 
Therefore, we can conclude that Q achieves the minimum of minggp^^ sup^(.-)g^^ / -D(P6)| i.e., 

sup f D{Pg\\Q)w{de)^[I{A^;B")]p^^^Q (164) 



On the other hand, 



sup D{Pg\\Q) = sup i^(PellQ) (165) 

see eeeo 



= sup [ D{Pe\\Q)wide) (166) 
< sup [ D{Pe\\Q)w{d9) (167) 
= [/(^";B")]p^_Q (168) 



Therefore, we can bound L{Q, Q), 



L{Q, Q) = sup Rg Q — mm sup Rg,Q (169) 
< [I{A-;B-)]p^_Q - . (170) 
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Appendix V 
Proof of Lemma[9] 

Proof: It is trivial that sup^g^^Q) /(A"; B") < n[I{A\ B)]p^^p^ for all n. Therefore, it is enough to 
find an upperbound of n[I{A] B)]p^^p^ — sup^g^^Q-, /(A"; B") that converges to as n increase. Recall that 
sup„,g^(e) /(A"; S") is equivalent to supp^^p /(A"; B"). 

Let probability law P^.c be a capacity achieving distribution of Gaussian channel with power constraint P — e 
and duty cycle constraint g — e. In other words, Pd.t is a supremum achiever of 

sup I{A;B), (171) 

E[A2]<P-£ 
P(A^O)<g-c 

where B is an output of standard Gaussian channel. Denote the measure Qp by projection of ^ on V, i.e.. 



= { ' (172) 

otherwise 

where 7^^"^ = {a" e M" : P^Ja"^) ^ 0, ^ ELi of <P,^ YJLi 7^ 0) < q} is a set of vector a" that satisfies 
constraints with e > more margin. Alternatively, let TVi"' = {a" e M" : P'lM') 7^ 0} \ ('^e'"')' namely set of 

(n) 

point masses that are not in the set Tt ■ Recall that P^^ is discrete, therefore, both Qp and P^^ are probability 
mass functions. It is clear that Qp(^V and Qp(a") = PJ^^(a"|A" e Te^"^). Denote p^"^ = P^e(^" ^ "^e^"^) = 
P^_,(A" e TVi"^), therefore, Qp(a") ^-^P^^(a")l(a" e T^*"^). By the law of large number, p^"^ is vanishing 
exponentially as n increase. Denote Qp(fo") and P^^{b") by output distributions of P" when the input law is Qp 
and P^\, respectively. Then, 

[I{A-;B-)]p - sup J(A";P") 

< [/(A"; P")]p^_p„^ - [/(A"; P")]p,_Q^ (173) 
= ([MS")]p,_p.^ - [Mi?")]p.„=Qj (174) 



= / Qp(6")loggp(6")-P^\(6")logP^\(&")d6" (175) 

=P(Qp(P")||P,",(P")) + / {Qp{b-) - Pl^{b-))\ogPl^{b-)dh- (176) 

< - log(l - / {PUb-) - Qpm)\ogPl,mdb- (177) 

< - log(l - - Qp(6") j logP,':,(6")d6". (178) 
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Note that 



Qpibn= E -^^d"eK)^(^"l«") (179) 



' 1 - 



^ PL(6"), (181) 



1 -P£ 



which impHes that — hj^^dei^^) ~ Qpi^") non-negative for all 6". Also, we can bound — logP^^(&") using 
Jensen's inequality. 

- logPrf^,(6") = - log ^P2,,{a'^)P{b^\a^)da"^ (182) 

^ - E^.^"^(«") log ((^71^ exp(-^||6" - a-\\l)^ (183) 

=nlog(y2^) + iE^<i"e(a")ll^'"-a"ll2rf«"- (184) 

Therefore, 

[7(A";B")]p^_ - sup 

- ~ (r^^^'"^^''"^ ~ ^''^^"^) - log(l -P^"^) (185) 

< (^^-^P,^,r)-Qp(6")j |^nlog(x/2^) + ^E^^"^(«")ll^"-«"ll2)'^^" + ^i"^ (186) 

= 1 /„ E (77^^-^"^^^") - ^^(^")) - «"llirf?>" + ^i"^ + ^^^nlog(x^) (187) 

/ E i^^PSAb'') - ) P2A<^nim\l + \K\\l)db" + ^i"^ + 4"^ (188) 

^(Ep.^ +Ep.^ [P«||2])-(Eq^ +Ep.^ [p«||2])+<5(")+4"^ (189) 



< 

Jb' 

1 



1 (") 
1 - 



i^(2Epn^ [\\A-\\l] +n)-(EQ^ +n + Epn^ ) + + (190) 



-■ . (n) (ri) 
. TIT- , riM"ll2l _ TI?„ ni zl"l|2l , _L _L 



1-p. 

=l±^EpnJp"||^] -EqJ||A"||^] + +<5r + (191) 

/ 1 \ (") 

= E — - + + <5(") +4"^ +4"^ (192) 



1 (")" 



= E f ^T^^d^.K)! Il«"ll2l(a" e ^i"))da'^ + 4"^ + 4"^ + 4"^ + 4"^ (194) 
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(195) 



where (5 



■(") j(' 
1 > "2 ) O3 , 04 



are defined as 




log(l-p(")) 

(n) 

^^nlog(v^) 



(196) 




(197) 



(198) 



(199) 



which are vanishing as n grows to infinity. Note that ||A"||2l(^" e J\fe ) converges to zero with probability 1 by 
strong law of large numbers, the expectation also converges to zero. By the continuity of mutual information, we 



[1] T. Duncan, "On the Calculation of Mutual Infomation," SIAM Journal on Applied Mathematics, vol. 19, no. 1, pp. 215220, 1970. 

[2] T. Weissman, "The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-Time AWGN Charmels," 

Information Theory, IEEE Transactions on, vol. 56, no. 9, pp. 4256 4273, Sep. 2010. 
[3] R. Atar, T. Weissman, "Mutual Information, Relative Entropy, and Estimation in the Poisson Charmel," Information Theory, IEEE 

Transactions on, vol. 58, no. 3, pp. 1302 1318, Mar. 2012. 
[4] N. Merhav, M. Feder, "A strong version of the redimdancy-capacity theorem of universal coding," Information Theory, IEEE Transactions 

on, vol. 41, no. 3, pp. 714 722, May 1995. 
[5] J. Rissanen, "Universal coding, information, prediction, and estimation," Information Theory, IEEE Transactions on, vol. 30, no. 4, pp. 

629 636, July 1984. 

[6] T. Weissman, Y.-H. Kim, and H. Permuter, "Directed information, causal estimation, and commimication in continuous time," Information 

Theory, IEEE Transactions on, vol. PP, no. 99, p. 1, Nov. 2012. 
[7] A. Banerjee, X. Guo, and H. Wang, "On the optimality of conditional expectation as a Bregman predictor," Information Theory, IEEE 

Transactions on, vol. 51, no. 7, pp. 2664 2669, July 2005. 
[8] R.G. Gallager, "Source coding with side information and universal coding," Tech. Rep. LIDS-P-937, Lab. Inform. Decision Syst., 1979. 
[9] L. Zhang, D. Guo, "Capacity of Gaussian chaimels with duty cycle and power constraints," in Information Theory Proceedings (ISIT), 
2011 IEEE International Symposium on, Aug. 2011, pp. 513 517. 
[10] S. Shamai, "On the capacity of a direct-detection photon channel with intertransition-constrained binary input," Information Theory, IEEE 

Transactions on, vol. 37, no. 6, pp. 1540 1550, Nov. 1991. 
[11] J. Bento, M. Ibrahimi, and A. Montanari, "Information theoretic limits on learning stochastic differential equations," in Information Theory 

Proceedings (ISIT), 2011 IEEE International Symposium on, Aug. 2011, pp. 855 859. 
[12] E. Lehmann, G. Casella, "Theory of point estimation," Springer, 1998, vol. 31. 



can finally conclude that [7(^"; B")]p^ 



sup^g^^Q) /(A"; B") converges to zero as n goes to infinity. ■ 
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